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Abstract 

Four methods of handling missing data were applied to missing values for variables selected from 
the National Educational Longitudinal Study of 1988. Variables used were those selected by 
Singh and Ozturk (1999) for a study concerning high school students’ academic achievement and 
work. Samples selected consisted of 100 cases, 300 cases, and 500 cases. The proportion of 
incomplete cases was manipulated to represent 30%, 50%, and 70% for each sample. In addition, 
composite variables were created and tested. Results indicate the EM algorithm and regression 
procedures provide accurate estimates under all conditions. Listwise and pairwise deletion were 
effective with small proportions of missing data and when composites were created. 
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Four methods of handling missing data in predicting educational achievement 

When data is analyzed in survey research, often there are missing values. If the 
mechanism causing the missing values is known, the solution to this problem may be incorporated 
in the study. Many times, however, the mechanism causing the missing values is not known. 
Ignoring this problem may lead to analysis of data that is of dubious value. 

In addition, different methods of handling missing values may produce different results. 
When Jackson (1968) entered data on aU the available variables in a disc riminan t analy sis^ the 
significance of the regression coefBcients of individual variables, as well as the interpretation of 
the importance of these variables, changed with the missing value method used. Witta and Kaiser 
(1991) also reported that the regression coefficients and total variance accounted for by the 
variables changed depending on the method used to handle missing values. After re-analyzing 
three studies of private/public school achievement, Ward and Clark III (1991) concluded that the 
method used to handle missing data influenced the outcome of these studies. 

In using the National Educational Longitudinal Study database to investigate the effects of 
part-time work on school outcomes Singh and Ozturk ( 1 999) eliminated more than half of the 
selected cases by listwise deletion of the incomplete data. In addition, composite variables were 
created to help explain the school outcomes. 

Statement of the Problem 

The purpose of the current study was tri-fold: (a) to investigate the effectiveness of four 
methods of handling missing data using the 26 variables in the Singh and Ozturk (1999) study, (b) 
to compare the effectiveness of the missing data methods after creating composite variables, and 
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(c) to compare the effectiveness of each missing data treatment using composite variables to the 

same treatment when using the individual predictor variables. Effectiveness was defined as the 
probability of accurately predicting achievement on standardized tests. Effectiveness of the 
missing data methods was assessed by manipulating the proportion of cases containing mksing 
values, the sample size, and the number of variables. The missing data handling methods studied 
were listwise deletion, pairwise deletion, regression and expectation maximization. Sample sizes 
investigated were 100, 300, and 500. The proportion of incomplete cases in each sample was 
30%, 50%, and 70%. 

Methods Studied 

Listwise Deletion 

Listwise deletion is probably the most fi-equently used method of handling missing data 
and is available as a default option in several statistical software programs. This method discards 
cases with a missing value on any variable and thus is very wasteful of data. Listwise deletion, 
however, has been shown to be more effective with low average intercorrelation, less than four 
variables and a small proportion of missing values (Chan, et.al., 1976; Haitovsky, 1968; Timm, 
1970). The assumption of missing completely at random is crucial to the use of this method. It is 
more likely, however, to find the complete sample different in important ways fi’om the 
incomplete sample (Little & Rubin, 1987). Problems for a researcher using this method include a 
reduction in power and an increase in standard error due to reduced sample size and the 
elimination of sub-populations. 

Pairwise Deletion 

When using pairwise deletion, covariances are computed between all pairs of variables 




5 



Missing data - predicting achievement 5 



having both observations, eliminating those that have a missing value for one of the two variables 
(Glasser, 1 964). Means and variances are computed on all available observations. The 
assumption made is that the use of the maximum number of pairs and all the individual 
observations yield more valid estimates of the relationship between the variables. It is assumed 
that when two variables are correlated, information on one improves the estimates of the other 
variable. It is also assumed that the pairs are a random subset of the sample pairs. If these 
assumptions are true, pairwise deletion produces unbiased estimates of the variable means and 
variances (Hertel, 1976). When missing data are not missing completely at random, however, the 
correlation matrix produced by pairwise deletion may not be Gramian (Norusis, 1988). 

Marsh (1998) investigated the estimates produced when using pairwise deletion for 
randomly missing data. From this study, which included five levels of missing data and three 
sample sizes. Marsh concluded parameter variability was explained, parameter estimates were 
unbiased, and only one covariance matrix was nonpositive definite. 

Regression 

Regression as an imputation method has many variations. The variations rely on 
information from other variables to estimate missing values. As the average intercorrelation and 
the number of variables from which these methods can obtain information increases, the 
regression methods, theoretically, perform better. Too many variables, however, can cause 
problems with over prediction (Kaiser & Tracy, 1988) and too high an average intercorrelation 
can result in a singular matrix. In these cases, regression does not perform well. 

Variations in the regression methods include differences in methods of developing the 
initial correlation matrix (listwise deletion, pairwise deletion, and mean substitution) and the 
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presence or absence of iteration procedures. Differences in regression methods also include the 

use of randomly selected residuals for iterations and assumptions of a normal distribution. 
Theoretically, the more variables considered that provide additional information, the better the 
estimate. Mundfrom and Whitcomb (1998) investigated the effects of using mean substitution, 
hot-deck imputation, and regression imputation on classification of cardiac patients. Mean 
substitution and hot-deck imputation correctly classified patients more fi'equently than regression 
imputation. 

Expectation Maximization 

Dempster, Laird, and Rubin (1977) recommended the use of the EM (expectation 
maximization) algorithm which imputes estimates simultaneously in an iterative procedure. The 
alternative is to estimate values and to adjust them one at a time using the Gauss-Seidel method. 
Both methods converge to the same final estimates, but the speed of convergence differs. The 
EM algorithm was advocated to hasten convergence. The E step of this algorithm finds the 
conditional expectation of the missing values. The M step performs maximum likelihood 
estimation as if there were no missing data. The primary difference between this procedure and 
the regression procedure is that the values for the missing data are not imputed and then iterated. 
The missing values are fimctions based on the conditional expectation (Little & Rubin, 1987). 
This method of handling missing data represents a fundamental shift in the way of thinking about 
missing data (Schafer & Olsen, 1998). 

Pattern of Missing Values 

All of the missing data handling procedures discussed require data missing at random 
(MAR) or missing completely at random (MCAR). Yet Cohen and Cohen (1983) suggested that 
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in survey research the absence of data on one variable may be related to another variable and may 
be due to the value of the variable itself When investigating simultaneously missing values, Witta 
(1996/97) found concurrently missing values (p<.001) in three of four samples using data from a 
national database. 

Schafer and Olsen (1998), however, argue convincingly that “every missing-data method 
must make some largely untestable statistical assumptions about the manner in which the missing 
values were lost” (p551). Consequently, when analyzing real data, researchers typically assume 
missing at random. 

Procedure 

All high school seniors who had reported working during their senior year of high school 
and for whom base-year and first follow-up data were available were included in this study. The 
initial sample contained the 26 variables used in the Singh and Ozturk study for 4664 subjects. 
These subjects were split into three populations: those containing one or more missing values but 
less than 14 and not having any missing values for standardized test scores (n=504), those 
containing more than 13 missing values (n=19) or missing values on the dependent standardized 
test variables (n=1038), and those containing no missing values on any variable (n=3103). The 19 
subjects having missing values for more than half the variables and the 1038 containing missing 
values for the standardized test scores were eliminated from further analysis. The remaining two 
populations (n=3607) were used to create samples for analysis. 

Creating Test Samples 

A sample containing 500 cases was randomly selected from the non-missing population. 
This target sample was duplicated twice. A sample of 350 cases was randomly select from the 
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missing population. These cases were used to replace an equal number of randomly selected cases 
from one of the target samples. This provided a test sample of 500 with 70% of the cases 
containing missing values. This process was repeated with the second target sample to provide a 
test sample with 50% (250) of the cases containing missing values. The process was repeated 
again with the third target sample to provide a test sample with 30% (150) of the cases containing 
missing values. 

This entire procedure was repeated twice to provide test samples with 30%, 50%, and 
70% of the cases containing missing values in test samples of 100 and 300 cases. Thus, 9 test 
samples were created. The missing values of each test sample were treated by each of the four 
missing data handling methods using SPSS 8.0 and SPSS Missing Data Analysis 7.3 . 

Analysis 

To answer research question 1, “to investigate the effectiveness of four methods of 
handling missing data using the 26 variables in the Singh and Ozturk (1999) study”, the SPSS 
missing data analysis 7.3 (HUl, 1997) subroutine was used to estimate values for regression and 
the EM algorithm. Each individual standardized test was then regressed on the remaining 
variables (not on other standardized tests) using the data produced by the missing analysis 
procedure and the pairwise and listwise procedures within the regression subroutine of SPSS 8.0. 
Predicted values from each regression were recorded. The mean vectors of the predicted values 
for each missing data method were then contrasted in MANOVA (multivariate analysis of 
variance). 

To answer research question 2, “to compare the effectiveness of the missing data methods 
after creating composite variables”, the mean of the four standardized test scores was used as the 
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dependent variable. Composite predictor variables were created by dete rmining the mean of the 

questions forming that construct (see Table A-1). When measurement scales differed, questions 
were converted to z scores prior to determining the mean. 

After treatment by a missing data method the standardized test score mean was regressed 
on each of the test samples. The predicted standardized test score for each test sample was 
compared to the actual standardized test mean using analysis of variance (ANOVA) with 
Dunnett’s test for comparing all treatments to a control (Howell, 1992) used as a post hoc. 

To answer research question 3, “to compare the effectiveness of each missing data 
treatment using composite variables to the same treatment when using individual variables”, the 
composite mean standardized test score was regressed on the individual questions after treatment 
by a missing data method. A predicted standardized test score was recorded for each method. The 
predicted score for each missing data method was contrasted with the actual score and with the 
score produced by that method using ANOVA with Dunnett’s and the Tukey post hoc tests. 

Results and Discussion 

Initially data was examined to determine the pattern of missing values as depicted in Table 
A-2. When individual questions were used, data was never missing completely at random. This 
assumption was only violated in one condition (70% incomplete of 300) when composite variables 
were used. As expected and as shown in Figure 1, use of composite variables increased the 
number of complete cases in each condition. 



Insert Figure 1 About Here 
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When the mean vectors of the four standardized tests produced by each missing data 
method and the actual mean vector were compared, statistically significant differences were 
detected in three conditions; when 50% of the cases were incomplete with a sample size of 500, 
and when 70% of the cases were incomplete with sample sizes of 300 and 500. These results are 
depicted in Table 1. 



Insert Table 1 About Here 



When 50% of the 500 cases were incomplete, none of the means produced by listwise 
deletion accurately reproduced the target means (see Table A-3). Under these conditions, pairwise 
deletion could not accurately replicate the standardized mathematics mean. All other missing data 
methods adequately reproduced the target means. 

When 70% of the cases were incomplete, the standardized test means produced by listwise 
deletion did not accurately reproduce the target means whenever the sample 300 or 500 cases. 
Under these condition, pairwise deletion reproduced adequately the target standardized reading 
test mean and the target standardized history mean, but not mathematics or science when the 
sample size was 300, but could not accurately reproduced any of the target means when the 
sample size was 500. The EM algorithm and regression procedures accurately reproduced the 
target sample means under all conditions. It should also be noted, the difference in missing data 
method never explained more than 1% of the variance in mean vectors and the actual difference 
between predicted and actual mean never exceeded 5 points. Thus, in response to research 
question 1, the EM algorithm and regression missing data procedures were more effective in 
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reproducing mean vectors than were pairwise or listwise deletion. In fact, both the EM and 
regression procedures produced mean vectors almost identical to the target mean vector. There 
was a reduction, however, in variability as has been noted by other researchers. 

When composite variables were created, there were no statistically significant differences 
in predicting standardized test score based on missing data method under any conditions as shown 
in Table 2. Again, method of handling missing data did not explain more than 1% of the variance 
in standardized test score. Apparently the reduction in proportion of cases was beneficial to the 
listwise and pairwise deletion methods. Composite standardized test means for each missing data 
method as well as actual means are included in Table A-4. 



Insert Table 2 About Here 



When the predicted composite standardized test scores (created by regressing composite 
test score on individual questions) produced by each missing data method were contrasted with 
the target composite test score, results were similar to those using the mean vector of each test 
score. As shown in Table 3, statistically significant differences were detected when the sample size 
was 500 with 50% incomplete cases, and when the sample size was 300 or 500 with 70% 
incomplete cases. Whenever these differences were detected, listwise and pairwise deletion were 
significant contributors (see Table A-5). The actual difference between the predicted and actual 
test mean was not more than 5 points. In this instance, however, 2% of the variance in test score 
could be attributed to group. 
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Insert Table 3 About Here 



Conclusion 

This study used one sample for each set of conditions. Consequently it is limited in 
generalizabiUty. In addition, there were a relatively large number of variables (26) with a small 
sample size (100). Thus, larger samples may produce different results. Considering these 
limitations, the following conclusions offered. 

Although statistically significant differences in standardized test scores were detected 
between the missing data method treatments, the variance accounted for by those differences was 
never more than 2%. Use of imputation procedures (EM and regression), however, provide more 
responses and correspondingly higher power. While reduction in variability by the EM and 
regression procedures is troubling, these methods provide greater power and produced more 
accurate estimates of mean vectors. Thus, it is recommended that researchers begin to implement 
these procedures more frequently. 

The use of composite variables produced no differences based on missing data method. 
Because the use of multiple similar variables provides more reliable indicators (although less 
precision) of a construct, this procedure is also recommended. If researchers do not wish to use 
procedures such as the EM algorithm or regression, creating composite variables provides an 
alternative that helps reduce the number of incomplete cases - possibly to an acceptable level. 

Finally, when the proportion of incomplete cases was small (30%), there were no 
statistically significant differences in the performance of the missing data methods. Therefore, if 
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the proportion of incomplete cases is small, any procedure will work. The best solution, however, 
is no missing data. 

Further research is needed to investigate more thoroughly the problems associated with 
variability reduction with the EM algorithm and regression procedures. In addition, further 
research is needed using actual data with real patterns of missing values. 
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Table 1 



Tests of Statistical Significance Using Individual Questions 



n 


Wilks' A 


F 


df 


df 


Eta^ 


30% 


100 


0.985 


0.418 


16 


1320.4 


<.01 


300 


0.992 


0.632 


16 


4008.9 


<.01 


500 


0.995 


0.639 


16 


6697.3 


<.01 


50% 


100 


0.963 


0.929 


16 


1198.2 


0.01 


300 


0.984 


1.236 


16 


3639.2 


<.01 


500 


0.981 


2.38** 


16 


6086.3 


0.01 


70% 


100 


0.976 


0.545 


16 


1073.0 


0.01 


300 


0.965 


2.40** 


16 


3275.7 


0.01 


500 


0.976 


2.70** 


16 


5472.2 


0.01 


Note. “ hvDothesis. 


error. 


*p<.05. **p<.01. 
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Table 2 

Tests of Statistical Significance Using Composite Questions 



n 


MS 

(between) 


MS 

(within) 


F 


df 

(between) 


df 

(within) 


Eta 

Squared 


30% 


100 


6.18 


57.46 


0.107 


4 


455.0 


<.01 


300 


20.60 


35.54 


0.58 


4 


1391.0 


<.01 


500 


30.62 


38.02 


0.805 


4 


2152.0 


<.01 


50% 


100 


17.85 


51.00 


0.35 


4 


429.0 


<.01 


300 


3.28 


38.50 


0.085 


4 


1313.0 


<.01 


500 


9.42 


37.53 


0.251 


4 


2205.0 


<.01 


70% 


100 


31.96 


47.29 


0.676 


4 


415.0 


0.01 


300 


2.15 


42.66 


0.05 


4 


1241.0 


<.01 


500 


1.85 


41.26 


0.045 


4 


2077.0 


<.01 



Note . *p<.05. **p<.01. 
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Table 3 

Tests of Statisti cal S i gn ificance Using Individual Questions with rnirmnsile Denendenl 



n 


MS 

(between) 


MS 

(within) 


F 


df 

(between) 


df 

(within) 


Eta 

Squared 


30 % 


100 


44.54 


66.82 


0.62 


4 


435 


<.01 


300 


92.27 


41.28 


2.33 


4 


1315 


<.01 


500 


89.92 


40.4 


2.23 


4 


2195 


<.01 


50 % 


100 


119.04 


59.35 


2.01 


4 


395 


0.02 


300 


101.8 


44.97 


2.26 


4 


1194 


<.01 


500 


353.2 


43.69 


8.08** 


4 


1995 


0.02 


70 % 


100 


98.05 


69.04 


1.42 


4 


354 


0.02 


300 


325.27 


52.33 


6.22** 


4 


1075 


0.02 


500 


511.73 


52.11 


9.82** 


4 


1794 


0.02 


Note. * d <. 05 . 


** D <. 01 . 
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Table A- 1 
Table A-2 
Table A-3 

Table A-4 

Table A-5 



Appendbc 

Composite Variable Questions 
Data Patterns for the Samples Used 

Standardized Test Means by Proportion Incomplete, Sample Size, and Missing 
Data Method 

Composite Standardized Test Means by Proportion Incomplete, Sample Size, and 
Missing Data Method 

Composite Standardized Test Means by Proportion Incomplete, Sample Size, and 
Missing Data Method with Individual Questions as Predictors 
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Composite Variable Questions 



Composite Variable 


Questions 


Parttime Work® 


F1S85 HOW MANY HRS DOES R USUALLY WORK A WEEK 
F2S88 CURRENT JOB, # HRS WORKED DURING SCHL YR 


Attendance 10® 


F1S10A HOW MANY TIMES WAS R LATE FOR SCHOOL 
F1S10B HOW MANY TIMES DID R CUT/SKIP CLASSES 
FI SI 3 HOW MANY DAYS WAS R ABSENT FROM SCHOOL 


Attendance 12 


F2S9A HOW MANY TIMES WAS R LATE FOR SCHOOL 
F2S9B HOW MANY TIMES DID R CUT/SKIP CLASSES 
F2S9C HOW MANY TIMES DIDR MISS SCHOOL 


Participation 10 


F1S40A OFTEN GO TO CLASS WITHOUT PENCIL/PAPER 
F1S40B OFTEN GO TO CLASS WITHOUT BOOKS 
F1S40C OFTEN GO TO CLASS WITHOUT HOMEWORK DONE 


Participation 12 


F2S24A GO TO CLASS WITHOUT PENCIL/PAPER 
F2S24B GO TO CLASS WITHOUT BOOKS 
F2S24C GO TO CLASS WITHOUT HOMEWORK DONE 


Homework 10 


F1S36A1 TIME SPENT ON HOMEWORK IN SCHOOL 
F1S36A2 TIME SPENT ON HOMEWORK OUT OF SCHOOL 


Homework 12 


F2S25F1 TOTAL TIME SPENT ON HMWRK IN SCHOOL 
F2S25F2 TOTAL TIME SPENT ON HMWRK OUT SCHL 


Grades 12 


F2RHENG2 AVERAGE GRADE IN ENGLISH (HS+B) 
F2RHMAG2 AVERAGE GRADE IN MATHEMATICS (HS+B) 
F2RHSCG2 AVERAGE GRADE IN SCIENCE (HS+B) 
F2RHSOG2 AVERAGE GRADE IN SOCIAL STUDIES (HS+B) 


Standardized Tests 


F22XHSTD HISTORY/CIT/GEOG STANDARDIZED SCORE 
F22XMSTD MATHEMATICS STANDARDIZED SCORE 
F22XRSTD READING STANDARDIZED SCORE 
F22XSSTD SCIENCE STANDARDIZED SCORE 



Data Patterns for the Samples Used 
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